Video games recommendation system¶

The aim of this notebook is to create a recommendation system that will give the user products, similiar to the one they chose.

The iteration of the porject will be kept in a git repository

  • git link - https://git.fhict.nl/I509460/video-game-reommendation.git

The project is created and work on by Mihail Kenarov

Introduction¶

  1. In the beggining of the semester we were introduced to first steps of what AI is about and how it works. At the time we were given a the opportunity to create a project of our liking with the first submission being known as Iteration 0, where we got to feedback for our initial ideas. For the submission I had just selected a dataset, which was full of empty values that I did not know what to do with and only mentioned the idea of the usage of kNN model, because to me it seemed like it was a matter of classification

  2. For the previous iteration(Iteration 1) of the notebook, I had created a Recommendation system project, that was made by using the description of the games and putting them through the TF*IDF model.(Term Frequency-Inverse Document Frequency). After doing so the kernel sigmoid was used that was used to compare all of the games which are vectorized by putting them between a 0 and 1 range and then comparin them. Finally it printed out the top 5 games closest to the one we selected

  • link for understading TF*IDF (https://www.youtube.com/watch?v=D2V1okCEsiE&ab_channel=KrishNaik)
  1. For this iteration(Iteration 2) I am going to use another model and implement more data cleaning as well as more preprocessing . While doing the modeling I will try yo implement graphs which will give me a better representation of what is currently going on with the system. Finally, I will write a conclusion of what are the differences between the 2 tries

Importing libraries¶

In [ ]:
import sklearn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

print("scikit-learn version:", sklearn.__version__) # 1.4.1
print("pandas version:", pd.__version__)            # 2.2.1
print("seaborn version:", sns.__version__)          # 0.13.2
scikit-learn version: 1.4.1.post1
pandas version: 2.2.2
seaborn version: 0.13.2

Phase 1¶

Domain Understanding¶

There are many things that should be understood when it comes to the creation of a recommendation system, but let us have a quick look into the basis of the subject.

What are recommendation systems¶

A recommendation system, also known as a recommender system, is a subset of machine learning. It leverages data to assist in predicting, refining, and identifying what individuals are seeking amidst an exponentially expanding array of choices.

What types of recommendation systems are there¶

  1. Collaborative filtering - it is based on the user data and the item data that we have. It can vary into 2 different categories:
  • User - User Collaborative filtering - User-Based Collaborative Filtering is a method employed to anticipate the preferences of a user by considering the ratings provided to items by other users with similar tastes to the target user. This technique is commonly utilized by numerous websites to construct their recommendation systems.

  • Item - Item Collaborative filtering - Rather than matching the user to similar customers, item-to-item collaborative filtering matches each of the user’s purchased and rated items to similar items, then combines those similar items into a recommendation list.

  1. Content based filtering - it is based mainly on the data that we have about the items can interact with. For example if they buy a certaion book, other books with either similiar genres, author, type etc. can be recommended.

  2. Hybrid Recommendation system - This kind of system is one that is created by the both ones that were previously mentioned. It uses that inputs from collaborative filtering and content based filtering and merges it together, so that the overall accuracy is better and not only that but it eleminates some problems such as the 'cold start', where the lack of data is a problem at the beggining, after deployment.


Links for more detailed information:
  • User - User Collaborative filtering https://www.geeksforgeeks.org/user-based-collaborative-filtering/

  • Item - Item Collaborative filtering https://www.geeksforgeeks.org/item-to-item-based-collaborative-filtering/?ref=ml_lbp

  • Types of Recommendation systems https://marutitech.medium.com/what-are-the-types-of-recommendation-systems-3487cbafa7c9

What are some of the more famous algorithms used for such systems¶

  1. Matrix Factorisation - Matrix factorization represents a category of collaborative filtering techniques utilized in recommendation systems. These algorithms function by breaking down the user-item interaction matrix into two lower-dimension rectangular matrices' product.

  2. Nearest Neighbors (kNN) - The simplest algorithm computes cosine or correlation similarity of rows (users) or columns (items) and recommends items that k — nearest neighbors enjoyed.

  3. TF*IDF - Term Frequency-Inverse Document Frequency, abbreviated as TF-IDF, is a metric that quantifies the significance of a word within a document in a collection or corpus, taking into account the adjustment for words that generally appear more frequently. It has been commonly utilized as a weighting factor in information retrieval, text mining, and user modeling searches.


  • kNN and Matrix Factorization https://medium.com/recombee-blog/machine-learning-for-recommender-systems-part-1-algorithms-evaluation-and-cold-start-6f696683d0ed

  • TF*IDF https://en.wikipedia.org/wiki/Tf%E2%80%93idf

What about some history of the recommendation systems¶

Can you guess which was the first recommendation system ever created ? - It was you! Recommendation systems have been with us since the creation of human time. It started exactly from us – the humans, spreading general ideas while talking to friends, family, or people we just enjoy being with, about things we would say go well together or we would like the people close to us to experience. These were the first ever recommendations that were ever given out and we still use them to this day.

With the evolution of technology, we even received even the first recommendation system, which was made by humans and operated on its own – “Grundy.” It was a system for the recommendation of books based on the users’ inputs. With time it started being criticized as all things in our world, especially in technology.

  • More about 'Grundy' and the history of recommendation systems https://onespire.net/history-of-recommender-systems/

Want to know more about recommendation systems?¶

If you are interested in getting a deeper understading on the questions we just discussed with even more details, as well as if you have deep interests in the world of recommendation systems and want to know more about such topics as:

- Pros, Cons, Ethical Problems and Limitations of Recommendation systems

- Who is most effected by the usage of such systems and in what ways

- Where are they implemented and in what ways

Feel free to have a look into the Project Proposal that was attached to the submission, together with this notebook

Phase 2¶

Data Requirements:¶

Considering the fact that video games have been with us for quite a while now, there are some things that we should take into consideration when tackling the projetc at hand. The most important of which is the data we are going to select for the usage of this project. Knowing that poeple have different tastes, we will definetly need the genres. Not only that but let us be honest, if we are going to be creating a recommendation system we will need the names. A developer might be helpful, and possibly the description. Some people will be interesed in older games while some in the newer ones so it might not be a bad idea to have the date of release and possibly who it was released from. One bit part of recommendations are the ratings that are given out to most of the products which in our case are the games.

For now what we know we would want:

- Name : Text
- Genre : Text
- Description : Text
- Developer : Text 
- Rating : Number 
- Date of release: Date Time format

There might be some other possible features that could be of usage, but for now these are some of the main things that we are going to be looking for

Data collection:¶

After looking into multiple places where one can gather data for such a project, I have decided to get this one from Kaggle

https://www.kaggle.com/datasets/gsimonx37/backloggd

It has been collected from this site: https://www.backloggd.com/, which to me seems like a good, created by fans of video games site. It has different games, their genres, ratings from users and allows the users to express their opinions on certain games

The data in it seems to fit the criteria of what we may need to use.

Let's have a closer look into the datasets we are provided¶

In [ ]:
import sklearn 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

games_df = pd.read_csv('games.csv')
devs_df = pd.read_csv('developers.csv')
genres_df = pd.read_csv('genres.csv')
platforms_df = pd.read_csv('platforms.csv')
scores_df = pd.read_csv('scores.csv')

Let's start one by one¶

Games dataframe

In [ ]:
games_df.head()
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
0 1000001 Cathode Ray Tube Amusement Device 1947-12-31 3.5 65.0 117.0 1.0 28.0 56.0 The cathode ray tube amusement device is the e...
1 1000002 Bertie the Brain 1950-08-25 2.5 11.0 24.0 0.0 6.0 12.0 Currently considered the first videogame in hi...
2 1000003 Nim 1951-12-31 1.8 2.0 11.0 0.0 2.0 6.0 The Nimrod was a special purpose computer that...
3 1000004 Draughts 1952-08-31 2.4 3.0 17.0 0.0 3.0 7.0 A game of draughts (a.k.a. checkers) written f...
4 1000005 OXO 1952-12-31 3.1 14.0 52.0 1.0 12.0 13.0 OXO was a computer game developed by Alexander...
In [ ]:
games_df.shape
Out[ ]:
(172512, 10)

Devs dataframe

In [ ]:
devs_df.head()
Out[ ]:
id developer
0 1000002 Josef Kates
1 1000004 Christopher Strachey
2 1000005 Alexander Shafto "Sandy" Douglas
3 1000005 University of Warwick
4 1000007 William Higinbotham
In [ ]:
devs_df.shape
Out[ ]:
(143454, 2)

Genres Dataframe

In [ ]:
genres_df.head()
Out[ ]:
id genre
0 1000001 Point-and-Click
1 1000002 Puzzle
2 1000002 Tactical
3 1000003 Pinball
4 1000003 Strategy
In [ ]:
genres_df.shape
Out[ ]:
(286025, 2)

Platforms Dataset

In [ ]:
platforms_df.head()
Out[ ]:
id platform
0 1000001 Analogue electronics
1 1000002 Arcade
2 1000003 Ferranti Nimrod Computer
3 1000004 Legacy Computer
4 1000005 Windows PC
In [ ]:
platforms_df.shape
Out[ ]:
(261475, 2)

Scores Dataset

In [ ]:
scores_df.head(15)
Out[ ]:
id score amount
0 1000001 0.5 10
1 1000001 1.0 5
2 1000001 1.5 1
3 1000001 2.0 3
4 1000001 2.5 9
5 1000001 3.0 10
6 1000001 3.5 2
7 1000001 4.0 2
8 1000001 4.5 3
9 1000001 5.0 41
10 1000002 0.5 0
11 1000002 1.0 3
12 1000002 1.5 0
13 1000002 2.0 4
14 1000002 2.5 2
In [ ]:
scores_df.shape
Out[ ]:
(1725120, 3)

Data Understanding:¶

Here is what we can gather from the information¶

  • The games dataset has 172512 rows and 10 columns

  • The developers dataset has 143454 rows and 2 columns

  • The genres dataset has 286025 rows and 2 columns

  • The platforms dataset has 261475 rows 2 columns

  • The scores dataset has 1725120 rows and 3 columns

But by what we are shown we can also create a dictionary that will allows us to have a bases of what we are working with and have a good overall look at what data is contaioned within what columns

Data Dictionary¶

  1. Games Dataset - basic data:
  • id - video game identifier (primary key);
  • name - name of the video game;
  • date - release date of the video game;
  • rating - average rating of the video game;
  • reviews - number of reviews;
  • plays - total number of players;
  • playing - number of players currently (at the time)
  • backlogs - the number of additions of a video game to the backlog;
  • wishlists - the number of times a video game has been added to “wishlist” (want to buy);
  • description - description of the video game.
  1. Developers dataset - developers (publishers):
  • id - video game identifier (foreign key);
  • developer - developer (publisher) of a video game.
  1. Platforms dataset - platforms of the games:
  • id - video game identifier (foreign key);
  • platform - gaming platform.
  1. Genres dataset - game genres:
  • id - video game identifier (foreign key);
  • genre - video game genre.
  1. Scores dataset - user ratings:
  • id - video game identifier (foreign key);
  • score - score (from 0.5 to 5 in increments of 0.5);
  • amount - number of users that gave this score

Let's Dive deeper into the main dataset that we are currently going to use: games_df

In [ ]:
games_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172512 entries, 0 to 172511
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   id           172512 non-null  int64  
 1   name         172512 non-null  object 
 2   date         137731 non-null  object 
 3   rating       55569 non-null   float64
 4   reviews      172511 non-null  float64
 5   plays        171818 non-null  float64
 6   playing      171818 non-null  float64
 7   backlogs     171818 non-null  float64
 8   wishlists    171818 non-null  float64
 9   description  153588 non-null  object 
dtypes: float64(6), int64(1), object(3)
memory usage: 13.2+ MB
In [ ]:
games_df.isnull().sum()
Out[ ]:
id                  0
name                0
date            34781
rating         116943
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64
In [ ]:
games_df.head()
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
0 1000001 Cathode Ray Tube Amusement Device 1947-12-31 3.5 65.0 117.0 1.0 28.0 56.0 The cathode ray tube amusement device is the e...
1 1000002 Bertie the Brain 1950-08-25 2.5 11.0 24.0 0.0 6.0 12.0 Currently considered the first videogame in hi...
2 1000003 Nim 1951-12-31 1.8 2.0 11.0 0.0 2.0 6.0 The Nimrod was a special purpose computer that...
3 1000004 Draughts 1952-08-31 2.4 3.0 17.0 0.0 3.0 7.0 A game of draughts (a.k.a. checkers) written f...
4 1000005 OXO 1952-12-31 3.1 14.0 52.0 1.0 12.0 13.0 OXO was a computer game developed by Alexander...

NB: We do see that a lot of the ratings are missing, but let's continue on, while keeping this in mind

Also we doo see that there is probably good correlation between the play,playing,backlogs and wishlists

While doing so we can also visualise some other curiosities like when were some games most of the games in the dataset made

In [ ]:
# Ensure the 'date' column is in datetime format
#games_df['date'] = pd.to_datetime(games_df['date'])

# Extract the year from the date
#games_df['year'] = games_df['date'].dt.year

# Count the number of games released each year
#games_per_year = games_df['year'].value_counts().sort_index()

# Plot the counts
#plt.figure(figsize=(10, 6))
#games_per_year.plot(kind='bar')
#plt.title('Number of Games Released by Year')
#plt.xlabel('Year')
#plt.ylabel('Number of Games Released')
#plt.show()

After trying to run this code I ran into am error that suggested problematic formating with the years, specificly there was a game that was registered 6969-06-09, at position 12171. We can take a look into that later as well, however I am still interested in a more accurate view of when were most games released

In [ ]:
problematic_row = games_df[games_df['date'] == '6969-06-09']

problematic_row
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
137716 1137717 The Mysterious Cat Tower 6969-06-09 NaN 0.0 0.0 0.0 0.0 1.0 A turn-based J/W/GRPG. Find spells to add to y...

After looking up this The Mysterious Cat Tower, it is actually a game that is planned for that given year...

  • link to the game https://store.steampowered.com/app/1706960/The_Mysterious_Cat_Tower/

Taking this into consideration... I guess the best thing to do is remove it because of the cat that it will neither be useful for the creation of our system, neither if a user ever wants to try it

In [ ]:
# Get the index of the problematic row
problematic_index = problematic_row.index

# Drop the problematic row
games_df = games_df.drop(problematic_index)

Now we should be able to see the First time a game has a data given or the latest games that should be released

In [ ]:
# Convert 'date' column to datetime, coercing errors to NaT
games_df['date'] = pd.to_datetime(games_df['date'], errors='coerce')

# Get the earliest (min) and latest (max) dates
min_date = games_df['date'].min()
max_date = games_df['date'].max()

print(f"Earliest date: {min_date}")
print(f"Latest date: {max_date}")


# Extract the year from the date
games_df['year'] = games_df['date'].dt.year

# Count the number of games released each year
games_per_year = games_df['year'].value_counts().sort_index()

# Plot the counts
plt.figure(figsize=(12, 6))
games_per_year.plot(kind='bar')
plt.title('Number of Games Released by Year')
plt.xlabel('Year')
plt.ylabel('Number of Games Released')
plt.show()
Earliest date: 1947-12-31 00:00:00
Latest date: 2030-12-20 00:00:00
No description has been provided for this image

I am having my doubts currently about the possible usage of the rows where the date is not given, but I also do not know if there is a correlation between the date and something else. For now, let us leave it again like this and start looking into the rest of the features and datasets

In [ ]:
# Select only the numerical columns
numerical_games_df = games_df.select_dtypes(include=['int64', 'float64'])

# Compute the correlation matrix
corr_matrix = numerical_games_df.corr()

# Create a correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image

As expected the reviews,plays,playing,backlogs and wishlists features do seem to have quite a good amount of correlation within eachother

In [ ]:
# Select the columns to plot
cols = ['reviews', 'plays', 'playing', 'backlogs', 'wishlists', 'year']
subset_df = games_df[cols]

# Create a scatter plot matrix
sns.pairplot(subset_df)
plt.show()
No description has been provided for this image

Observation from the scatterplot:¶

we do not see a specific pattern that we can take up to work with, however we will probably get pack to these problems as well soon.

Genres dataset¶

Let us take a look into the genres dataset and see what we can do there

In [ ]:
print(genres_df.nunique()) 
genres_df.head(15)
id       147369
genre        23
dtype: int64
Out[ ]:
id genre
0 1000001 Point-and-Click
1 1000002 Puzzle
2 1000002 Tactical
3 1000003 Pinball
4 1000003 Strategy
5 1000004 Card & Board Game
6 1000005 Puzzle
7 1000005 Strategy
8 1000006 Sport
9 1000007 Arcade
10 1000007 Sport
11 1000008 Simulator
12 1000009 Shooter
13 1000009 Simulator
14 1000010 Strategy

From what we see it does seem like it would be a wise idea to proceed in this direction:

1. Put all of the genres of a game to be on the same row

2. Encode it in a way so that the genres are looked into as a binary yes/no columns

Putting the genres of a game in the same row:

In [ ]:
# Convert the 'genre' column to string
genres_df['genre'] = genres_df['genre'].astype(str)

# Group by 'id' and join the genres into a single string
genres_df = genres_df.groupby('id')['genre'].apply(', '.join).reset_index()
In [ ]:
genres_df.head(10)
Out[ ]:
id genre
0 1000001 Point-and-Click
1 1000002 Puzzle, Tactical
2 1000003 Pinball, Strategy
3 1000004 Card & Board Game
4 1000005 Puzzle, Strategy
5 1000006 Sport
6 1000007 Arcade, Sport
7 1000008 Simulator
8 1000009 Shooter, Simulator
9 1000010 Strategy

Performing one-hot encoding to have the genres "vectorised"

In [ ]:
# Perform one-hot encoding
genres_df_encoded = genres_df['genre'].str.get_dummies(sep=', ')

# Join the encoded genres back to the 'id' column
genres_df_encoded = pd.concat([genres_df['id'], genres_df_encoded], axis=1)

genres_df_encoded.head(10)
Out[ ]:
id Adventure Arcade Brawler Card & Board Game Fighting Indie MOBA Music Pinball ... RPG Racing Real Time Strategy Shooter Simulator Sport Strategy Tactical Turn Based Strategy Visual Novel
0 1000001 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1000002 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 1000003 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
3 1000004 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1000005 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
5 1000006 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
6 1000007 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
7 1000008 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 1000009 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 0 0 0 0 0
9 1000010 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0

10 rows × 24 columns

What would be let's say the top 10 most popular genres

In [ ]:
# Convert genre names to lowercase, split on comma, strip whitespaces from each genre name, and count the occurrences
genre_counts = genres_df['genre'].str.lower().str.split(',').apply(lambda x: [i.strip() for i in x]).explode().value_counts()

# Select the top 10 genres
top_10_genres = genre_counts.head(10)

# Print the top 10 genres
print(top_10_genres)

# Create a bar plot of the top 10 genres
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_genres.index, y=top_10_genres.values)
plt.title('Top 10 Most Popular Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
genre
indie        50501
adventure    49653
simulator    22828
rpg          22320
strategy     21701
shooter      18542
puzzle       17496
arcade       14872
platform     14025
sport        10407
Name: count, dtype: int64
No description has been provided for this image

As expected, it would be normal to see that many of the games are of the Adventure as well as it would seem that we do have quite a lot of games which are Indie as well

Now let us combine what we have done until now¶

In [ ]:
# Merge the two DataFrames on 'id'
combined_df = pd.merge(games_df, genres_df_encoded, on='id', how='inner')
combined_df_no_encoding =  pd.merge(games_df, genres_df, on='id', how='inner')
In [ ]:
print(combined_df.shape)
combined_df.head()
(147368, 34)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description ... RPG Racing Real Time Strategy Shooter Simulator Sport Strategy Tactical Turn Based Strategy Visual Novel
0 1000001 Cathode Ray Tube Amusement Device 1947-12-31 3.5 65.0 117.0 1.0 28.0 56.0 The cathode ray tube amusement device is the e... ... 0 0 0 0 0 0 0 0 0 0
1 1000002 Bertie the Brain 1950-08-25 2.5 11.0 24.0 0.0 6.0 12.0 Currently considered the first videogame in hi... ... 0 0 0 0 0 0 0 1 0 0
2 1000003 Nim 1951-12-31 1.8 2.0 11.0 0.0 2.0 6.0 The Nimrod was a special purpose computer that... ... 0 0 0 0 0 0 1 0 0 0
3 1000004 Draughts 1952-08-31 2.4 3.0 17.0 0.0 3.0 7.0 A game of draughts (a.k.a. checkers) written f... ... 0 0 0 0 0 0 0 0 0 0
4 1000005 OXO 1952-12-31 3.1 14.0 52.0 1.0 12.0 13.0 OXO was a computer game developed by Alexander... ... 0 0 0 0 0 0 1 0 0 0

5 rows × 34 columns

I do However question if it is possible to find a certain way to maybe use the genres to find the years, but I am still not sure if that may be possible¶

In [ ]:
# Convert 'release_date' to datetime
combined_df['date'] = pd.to_datetime(combined_df['date'])

# List of genres
genres = ['Adventure', 'Arcade', 'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA', 'Music',
          'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia', 'RPG', 'Racing', 
          'Real Time Strategy', 'Shooter', 'Simulator', 'Sport', 'Strategy', 'Tactical', 
          'Turn Based Strategy', 'Visual Novel']

# Resample the data by month and count the number of games for each genre
monthly_counts = combined_df.resample('M', on='date')[genres].sum()

# Create a line plot for each genre
plt.figure(figsize=(24, 8))
for genre in genres:
    plt.plot(monthly_counts.index, monthly_counts[genre], label=genre)
plt.title('Number of Games Over Time by Genre')
plt.xlabel('Date')
plt.ylabel('Count')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.show()
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\1492821078.py:11: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_counts = combined_df.resample('M', on='date')[genres].sum()
No description has been provided for this image
In [ ]:
# Create a stacked area plot for each genre
plt.figure(figsize=(18, 6))
plt.stackplot(monthly_counts.index, monthly_counts[genres].T)
plt.title('Number of Games Over Time by Genre')
plt.xlabel('Date')
plt.ylabel('Count')
plt.legend(genres, loc='upper left', bbox_to_anchor=(1,1))
plt.show()
No description has been provided for this image
In [ ]:
# Create a scatter plot for each genre
fig, axs = plt.subplots(len(genres), figsize=(10, 6*len(genres)))
for i, genre in enumerate(genres):
    axs[i].scatter(monthly_counts.index, monthly_counts[genre])
    axs[i].set_title('Number of ' + genre + ' Games Over Time')
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Count')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# Reshape the DataFrame
reshaped_df = monthly_counts.reset_index().melt(id_vars='date', var_name='genre', value_name='count')

# Create a FacetGrid
g = sns.FacetGrid(reshaped_df, col='genre', col_wrap=5, height=4)
g = g.map(plt.plot, 'date', 'count')
plt.show()
No description has been provided for this image

It does not really feel like it so I will probably just leave it like that for now, Let us continue looking into the empty spots¶

In [ ]:
games_df.isna().sum()
Out[ ]:
id                  0
name                0
date            34781
rating         116942
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
year            34781
dtype: int64
In [ ]:
# Convert 'date' to datetime if it's not already
games_df['date'] = pd.to_datetime(games_df['date'])

# Extract the year from the date
games_df['release_year'] = games_df['date'].dt.year

# Calculate the average rating per year
average_rating_per_year = games_df.groupby('release_year')['rating'].mean()
In [ ]:
# Print the average rating per year
print(average_rating_per_year)
release_year
1947.0    3.500000
1950.0    2.500000
1951.0    1.800000
1952.0    2.750000
1954.0    3.000000
            ...   
2025.0    1.966667
2026.0         NaN
2027.0         NaN
2029.0         NaN
2030.0         NaN
Name: rating, Length: 72, dtype: float64

It is understandable that years which will be released in the future still do not have a rating. The cleaning will continue. Now we will take our attention to reviews

In [ ]:
# Find rows where 'reviews' is NaN
empty_reviews_rows = games_df[games_df['reviews'].isna()]

# Print the empty reviews rows
empty_reviews_rows
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description year release_year
136540 1136541 Enigma of Fear 2024-03-31 2.3 NaN 21.0 0.0 42.0 230.0 Become Mia, a paranormal detective searching f... 2024.0 2024.0

This is understandable again, however I am interested to see what is the case with games that are anounced after 2024

In [ ]:
# Create a subset of games released after 2024
games_after_2024 = games_df[games_df['release_year'] > 2024]

# Print the 'reviews' column of the subset
games_after_2024.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description year release_year
137685 1137686 Tales of the Death 2025-12-31 NaN 0.0 0.0 0.0 0.0 7.0 Join a comic adventure in a hand-drawn book wi... 2025.0 2025.0
137689 1137690 DecaPolice 2025-12-31 NaN 0.0 5.0 1.0 127.0 575.0 DecaPolice, a crime-suspense RPG from Level-5,... 2025.0 2025.0
137709 1137710 Margaritari 2026-01-01 NaN 0.0 0.0 0.0 4.0 12.0 "Margaritari", a JRPG-style narrative game. Co... 2026.0 2026.0
137674 1137675 Grand Theft Auto VI 2025-12-31 NaN 16.0 83.0 6.0 557.0 2452.0 Grand Theft Auto VI heads to the state of Leon... 2025.0 2025.0
137715 1137716 PilotXross 2030-12-20 NaN 0.0 2.0 0.0 2.0 3.0 VR flight game developed for VR devices.The pl... 2030.0 2030.0
137701 1137702 Eternity Guards 2025-12-31 NaN 0.0 0.0 0.0 0.0 1.0 The main idea of the game is a fantasy alterna... 2025.0 2025.0
137680 1137681 Mouse 2025-12-31 NaN 1.0 0.0 0.0 87.0 371.0 Join private detective John Mouston in MOUSE, ... 2025.0 2025.0
137683 1137684 Rise of Rebellion 2025-12-31 NaN 0.0 0.0 0.0 0.0 4.0 "Rise of Rebellion" is a 3D action RPG that pu... 2025.0 2025.0
137684 1137685 Twistales 2025-12-31 NaN 0.0 0.0 0.0 1.0 11.0 On the day of what she believed to be her ‘Hap... 2025.0 2025.0
137669 1137670 Monster Hunter Wilds 2025-12-31 NaN 3.0 5.0 1.0 155.0 646.0 Monster Hunter Wilds. The next generation in t... 2025.0 2025.0
137668 1137669 Silent Planet 2025-12-31 NaN 0.0 0.0 0.0 2.0 2.0 Silent Planet is an exploration-focused, dark ... 2025.0 2025.0
137670 1137671 Kipidon 2025-12-31 NaN 0.0 1.0 0.0 1.0 5.0 A colorful shooter that uses a flower as a cup... 2025.0 2025.0
137655 1137656 Dungellion 2025-04-01 NaN 0.0 0.0 0.0 0.0 0.0 Rogue-lite with elements of Battle Royale and ... 2025.0 2025.0
137677 1137678 Riversiders 2025-12-31 NaN 0.0 1.0 0.0 0.0 0.0 Raft down the river, camp in the wild, and mak... 2025.0 2025.0
137699 1137700 Big Boss: A Villain Simulator 2025-12-31 NaN 0.0 0.0 0.0 0.0 0.0 An asymmetrical roguelite experience where the... 2025.0 2025.0

After looking more into it and what the is going on on the site, it does show that there are games with reviews, although the their still have not been released, in that case it does seem alright if we just substitute the reviews of games after 2024 with a 0

Another example of what I saw: image.png

However on the site, we are not shown any reviews - https://www.backloggd.com/games/2xko/

This is a problem that is happening to games that have not been released yet, so I do believe that it will be alright if we put the reviews at 0 for the games after 2024

In [ ]:
games_df.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description year release_year
6130 1006131 Rambo 1988-12-04 1.9 6.0 96.0 0.0 20.0 7.0 Rambo is a side scrolling platform game where ... 1988.0 1988.0
94916 1094917 MMX: Otherworld Mystery - Expanded Edition 2019-10-17 NaN 0.0 0.0 0.0 1.0 0.0 The mystery of the Otherworldly World MMX will... 2019.0 2019.0
112163 1112164 No More Heroes III 2021-08-27 3.9 440.0 2569.0 140.0 2435.0 1857.0 The latest numbered entry in the No More Heroe... 2021.0 2021.0
8571 1008572 King's Bounty 1990-12-31 3.0 1.0 32.0 0.0 23.0 5.0 While King Maximus ruled the land, life was go... 1990.0 1990.0
145671 1145672 WSOP NaT NaN 0.0 1.0 0.0 0.0 0.0 Show off your Texas Hold'em Poker skills! The... NaN NaN
54334 1054335 Cities in Motion 2 2013-04-02 2.3 2.0 67.0 1.0 54.0 3.0 Cities in Motion 2 (CIM2) is the sequel to the... 2013.0 2013.0
74582 1074583 Vikings: Wolves of Midgard 2017-03-24 2.5 4.0 106.0 4.0 102.0 10.0 Vikings: Wolves of Midgard takes you to the Sh... 2017.0 2017.0
44345 1044346 Hikari no Valusia ~What a Beautiful Hopes~ 2009-11-20 NaN 0.0 2.0 0.0 6.0 8.0 In a big city in the desert called Valcia, onl... 2009.0 2009.0
155460 1155461 Kanojo wa Dare to demo Sex suru. NaT NaN 0.0 1.0 0.0 0.0 1.0 A visual novel from Orcsoft Team Goblin. NaN NaN
133864 1133865 Inescapable: No Rules, No Rescue 2023-10-19 NaN 1.0 5.0 0.0 8.0 17.0 Inescapable is a social thriller set in a trop... 2023.0 2023.0
150024 1150025 Salvage NaT NaN 0.0 0.0 0.0 2.0 0.0 Turn-based bullet-hell class warfare NaN NaN
56923 1056924 Detective Grimoire: Secret of the Swamp 2014-01-02 3.3 28.0 296.0 4.0 99.0 42.0 Solve puzzles, collect clues, explore the swam... 2014.0 2014.0
126182 1126183 Hitman 3: Dubai 2023-01-20 3.8 0.0 10.0 0.0 2.0 0.0 Experience the grandeur and decadence of Dubai... 2023.0 2023.0
121792 1121793 Madden NFL 23 2022-08-19 2.3 45.0 252.0 18.0 21.0 11.0 Play your way into the history books! Updates ... 2022.0 2022.0
107040 1107041 The Yellow Rose Motel 2021-02-22 NaN 0.0 0.0 0.0 0.0 0.0 The yellow rose motel is a ps1/VHS style low p... 2021.0 2021.0
In [ ]:
# Drop 'year' and 'release_year' columns
games_df = games_df.drop(['year', 'release_year'], axis=1)
In [ ]:
games_df.isna().sum()
Out[ ]:
id                  0
name                0
date            34781
rating         116942
reviews             1
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64
In [ ]:
# Find rows where 'reviews' is NaN
empty_reviews_rows = games_df[games_df['reviews'].isna()]

# Print the empty reviews rows
empty_reviews_rows
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
136540 1136541 Enigma of Fear 2024-03-31 2.3 NaN 21.0 0.0 42.0 230.0 Become Mia, a paranormal detective searching f...
In [ ]:
games_df['reviews'] = games_df['reviews'].fillna(0)
print(games_df)
             id                                         name       date  \
0       1000001            Cathode Ray Tube Amusement Device 1947-12-31   
1       1000002                             Bertie the Brain 1950-08-25   
2       1000003                                          Nim 1951-12-31   
3       1000004                                     Draughts 1952-08-31   
4       1000005                                          OXO 1952-12-31   
...         ...                                          ...        ...   
172507  1172508  Super Robot Wars 30: Digital Deluxe Edition        NaT   
172508  1172509           Xotic: Temple Crypt Expansion Pack        NaT   
172509  1172510                                 Dust Raiders        NaT   
172510  1172511                                    EXE Clash        NaT   
172511  1172512      Dance Killer Trick!!!: Boys, Be Dancing        NaT   

        rating  reviews  plays  playing  backlogs  wishlists  \
0          3.5     65.0  117.0      1.0      28.0       56.0   
1          2.5     11.0   24.0      0.0       6.0       12.0   
2          1.8      2.0   11.0      0.0       2.0        6.0   
3          2.4      3.0   17.0      0.0       3.0        7.0   
4          3.1     14.0   52.0      1.0      12.0       13.0   
...        ...      ...    ...      ...       ...        ...   
172507     NaN      0.0    0.0      0.0       0.0        0.0   
172508     NaN      0.0    1.0      0.0       0.0        0.0   
172509     NaN      0.0    0.0      0.0       0.0        2.0   
172510     NaN      0.0    0.0      0.0       0.0        1.0   
172511     NaN      0.0    2.0      0.0       0.0        1.0   

                                              description  
0       The cathode ray tube amusement device is the e...  
1       Currently considered the first videogame in hi...  
2       The Nimrod was a special purpose computer that...  
3       A game of draughts (a.k.a. checkers) written f...  
4       OXO was a computer game developed by Alexander...  
...                                                   ...  
172507                                                NaN  
172508  Explore the mystical crypts and forgotten pass...  
172509  Dust Raiders is a management strategy game, se...  
172510  A platform fighting game featuring many spooky...  
172511  Dance Killer Trick!!!: Boys, Be Dancing is an ...  

[172511 rows x 10 columns]
In [ ]:
games_df.isna().sum()
Out[ ]:
id                  0
name                0
date            34781
rating         116942
reviews             0
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64

Scatterplot for the with the genres - not encoded¶

In [ ]:
genres_df.head()
Out[ ]:
id genre
0 1000001 Point-and-Click
1 1000002 Puzzle, Tactical
2 1000003 Pinball, Strategy
3 1000004 Card & Board Game
4 1000005 Puzzle, Strategy
In [ ]:
# Select the columns to plot
cols1 = ['reviews', 'plays', 'playing', 'backlogs', 'wishlists', 'year']

subset_df = combined_df_no_encoding[cols1]

# Create a scatter plot matrix
sns.pairplot(subset_df)

plt.show()
No description has been provided for this image
In [ ]:
print(games_df.isna().sum())
games_df.sample(15)
id                  0
name                0
date            34781
rating         116942
reviews             0
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
89230 1089231 VMod 2019-01-10 NaN 0.0 1.0 0.0 0.0 0.0 VMod is a puzzle game about side-effects. Tap ...
73562 1073563 Elaine 2017-02-14 NaN 0.0 0.0 0.0 1.0 0.0 This is elaine! A minimalistic tile-matching a...
61256 1061257 GrindCraft 2015-01-30 2.5 3.0 10.0 0.0 0.0 0.0 GrindCraft is a Minecraft-themed clicker game ...
115156 1115157 PsiloSybil 2021-12-07 3.2 5.0 34.0 3.0 48.0 98.0 An old-school, tough-as nails classic linear 3...
114581 1114582 Door 3: Insignia 2021-11-15 NaN 0.0 0.0 0.0 3.0 0.0 Symbols have never had such an impact on a per...
119409 1119410 Under the Jolly Roger: Complete Edition 2022-05-23 NaN 0.0 1.0 0.0 0.0 1.0 The Complete Edition includes: - Under the Jol...
88024 1088025 Daze 2018-11-27 NaN 0.0 0.0 0.0 0.0 0.0 The world you’re about to experience Crafted b...
90838 1090839 City of Heroes: Homecoming 2019-04-02 NaN 0.0 3.0 4.0 0.0 0.0 City of Heroes: Homecoming is an officially li...
129848 1129849 PonyGuessr 2023-05-17 NaN 1.0 3.0 0.0 0.0 1.0 Guess the episode each My Little Pony: Friends...
21081 1021082 Red Baron II 1998-10-30 NaN 0.0 5.0 0.0 3.0 0.0 Red Baron II by Dynamix, Inc is the sequel to ...
46288 1046289 Leo & Leah 2010-07-22 NaN 0.0 0.0 0.0 0.0 0.0 Rather simple. Leo and Leah have been close si...
23580 1023581 NASCAR Challenge 2000-01-01 2.8 1.0 5.0 0.0 2.0 0.0 The thrill of victory and the agony of smashin...
162822 1162823 Shadows of the Damned NaT NaN 0.0 1.0 0.0 0.0 0.0 Well after its 2011 Xbox 360 and PlayStation 3...
36824 1036825 The Mysterious Mine Bouncin' Back Edition 2006-12-21 NaN 0.0 0.0 0.0 1.0 1.0 This is the author’s Earthbound hack that they...
46254 1046255 duplicate Hakuoki Junsouroku 2010-07-17 NaN 0.0 2.0 0.0 2.0 0.0 Released in Japan in 2010.
In [ ]:
# Select rows where 'plays' is NaN
nan_plays = games_df[games_df['plays'].isna()]

# Print the selected rows
nan_plays.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
6591 1006592 Circus Games 1988-12-31 NaN 0.0 NaN NaN NaN NaN Ladiiiieeeees and gentlemeeeeennnnn! Children ...
74575 1074576 The Cat Games 2017-03-24 3.1 5.0 NaN NaN NaN NaN Do you like cats? Then this is the purrfect ga...
147678 1147679 Play the Games Vol. 5 NaT NaN 0.0 NaN NaN NaN NaN NaN
45924 1045925 Family Gameshow 2010-06-04 NaN 0.0 NaN NaN NaN NaN Good evening and welcome to Family Gameshow! T...
132066 1132067 Food Fighter Clicker Games 2023-08-22 NaN 0.0 NaN NaN NaN NaN Food Fighter Clicker is a clicker game to beco...
34259 1034260 2 Games in 1: Sonic Pinball Party + Sonic Battle 2005-11-11 NaN 0.0 NaN NaN NaN NaN Bundle containing Sonic Pinball Party and Soni...
24633 1024634 Barbie: Fashion Pack Games 2000-10-01 NaN 0.0 NaN NaN NaN NaN Discover a world of fashion fun with Barbie! P...
145376 1145377 Game Chest: Board Games NaT NaN 0.0 NaN NaN NaN NaN NaN
11504 1011505 California Games II 1992-12-31 NaN 0.0 NaN NaN NaN NaN Amiga port of California Games II.
79422 1079423 Red 2017-11-01 NaN 0.0 NaN NaN NaN NaN Try your best in the hardest game ever release...
159061 1159062 Pack 2 Games Pony Friends 2 + My Riding Stable... NaT NaN 0.0 NaN NaN NaN NaN NaN
11783 1011784 Summer Games 1992-12-31 NaN 0.0 NaN NaN NaN NaN NaN
11686 1011687 California Games II 1992-12-31 NaN 0.0 NaN NaN NaN NaN Atari ST port of California Games II.
102175 1102176 The Dreamcatcher 2020-08-21 NaN 1.0 NaN NaN NaN NaN There is a saying that "what you think about i...
58506 1058507 Escape from Jay Is Games 2014-06-06 NaN 0.0 NaN NaN NaN NaN It's a room escape point and click puzzle game...

After having a look into the samples of the games that don't have any plays, it does seem that they are connected heavily with the playing.backlogs and wishlists. There are quite some examples of data where we do not have neither the the release date or any of the playing.backlogs,wishlists,plays - notting when it comes to the description as well.

Examples:

image-3.png image-2.png

I looked them up on the site and I did see some cases where we might be able to find something, but it does not look like it

https://www.backloggd.com/games/play-the-games-vol-5/

image.png

After finding this I do believe that It will be alright if we drop the columns that do not have neither a date nor anything else, they are pretty much empty rows with nothing but a title and id, so there is no reason for us to have a recommendation system that does not give anything of actaul value, to the user.

In [ ]:
print(nan_plays.shape)

nan_plays.isna().sum()
(694, 10)
Out[ ]:
id               0
name             0
date           158
rating         530
reviews          0
plays          694
playing        694
backlogs       694
wishlists      694
description    172
dtype: int64
In [ ]:
nan_plays.dropna(subset=['date', 'plays'], how='all', inplace=True)

nan_plays.isna().sum()
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2504222685.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nan_plays.dropna(subset=['date', 'plays'], how='all', inplace=True)
Out[ ]:
id               0
name             0
date             0
rating         386
reviews          0
plays          536
playing        536
backlogs       536
wishlists      536
description     67
dtype: int64
In [ ]:
nan_plays.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
6855 1006856 California Games 1989-05-01 2.8 0.0 NaN NaN NaN NaN Master System port of California Games.
4617 1004618 Future Games 1986-12-31 NaN 0.0 NaN NaN NaN NaN NaN
103258 1103259 Encore Classic Casino Games 2020-10-05 NaN 0.0 NaN NaN NaN NaN You've hit the jackpot with the most comprehen...
51498 1051499 50 Classic Games 3D 2012-04-24 NaN 1.0 NaN NaN NaN NaN Now your favorite classic pastimes are bundled...
107691 1107692 Bocchi Kaihi 2021-03-17 NaN 0.0 NaN NaN NaN NaN Rescue the mysterious main character with a sh...
42138 1042139 Best of Arcade Games DS 2009-01-28 NaN 0.0 NaN NaN NaN NaN NaN
15177 1015178 Family Games 1995-02-01 NaN 0.0 NaN NaN NaN NaN Here's a game collection for the whole family....
4403 1004404 Future Games 1986-12-31 NaN 0.0 NaN NaN NaN NaN NaN
17841 1017842 3DO Games: Decathlon 1996-10-31 NaN 0.0 NaN NaN NaN NaN As you would expect, all ten events are repres...
27691 1027692 Love Game's Wai Wai Tennis Plus 2002-04-28 NaN 0.0 NaN NaN NaN NaN Tune up for "Wai Wai Tennis", the authentic po...
34217 1034218 Clubhouse Games 2005-11-03 3.5 28.0 NaN NaN NaN NaN It's game night and everyone's invited. Play m...
3594 1003595 Rainy Day Games 1985-12-31 NaN 0.0 NaN NaN NaN NaN Rainy Day Games is 1 - 3 player series of 3 ca...
42283 1042284 Clubhouse Games Express: Strategy Pack 2009-02-25 NaN 0.0 NaN NaN NaN NaN A DSiWare game based on the original Clubhouse...
101651 1101652 Rage of Car Force: Car Crashing Games 2020-07-30 NaN 0.0 NaN NaN NaN NaN Rage of Car Force is a team multiplayer PvP ga...
18670 1018671 Love Game's Wai Wai Tennis 1997-02-28 NaN 0.0 NaN NaN NaN NaN NaN

image-4.png image.png image-2.png image-3.png image-5.png image-6.png

After looking into more of the examples that we have for games that do not have any values in the ones we are currently interested in it's either because of 2 things it would seem:

1. It is some sort of a bundel of games that we do not have any details about
2. The games actaully do not have that much data for them so they pretty much are 'left out' and have not been interacted with

image.png

____________________________________________________________________________________Another Example __________________________________________________________________________

image-2.png

After considering all of this, I do believe that it will be fine to just drop all of the rows that have the missing values in plays,playing,backlogs and wishlists

In [ ]:
games_df.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
169786 1169787 Super Robot Taisen Compact 3 NaT NaN 0.0 3.0 0.0 6.0 4.0 Super Robot Taisen Compact 3 is the final Wond...
100074 1100075 Sonic 3D in 2D 2020-05-20 2.5 7.0 27.0 5.0 14.0 6.0 Sonic 3D in 2D is a fangame that reimagines So...
110701 1110702 Labyrinths of the World: The Game of Minds - C... 2021-07-02 NaN 0.0 0.0 0.0 0.0 2.0 Desperate Times Call for Desperate Measures...
144580 1144581 Rivals at War: Firefight NaT NaN 0.0 2.0 0.0 0.0 0.0 Rivals at War: Firefight is a military themed,...
72497 1072498 Halodoom: Code of Silence 2016-12-31 NaN 0.0 3.0 0.0 6.0 4.0 Halodoom is a project about the past and the p...
69030 1069031 Robot Legions Reborn 2016-07-19 NaN 0.0 3.0 0.0 0.0 0.0 On a planet overrun by robots, one rogue unit ...
64785 1064786 Sword Coast Legends 2015-10-20 1.7 4.0 33.0 1.0 22.0 3.0 Set in the lush and vibrant world of the Forgo...
48990 1048991 Wii Play: Motion 2011-06-13 3.1 29.0 418.0 3.0 59.0 49.0 Wii Play: Motion is a minigame collection that...
44854 1044855 Situation Outbreak 2009-12-31 NaN 0.0 0.0 0.0 0.0 0.0 Situation Outbreak is an Orange Box mod about ...
130056 1130057 Sunshine Shuffle 2023-05-24 3.0 14.0 32.0 0.0 14.0 28.0 Sunshine Shuffle is a narrative poker adventur...
4970 1004971 Hysteria 1987-08-01 NaN 0.0 2.0 0.0 3.0 0.0 A fanatical sect has changed mankind's future ...
40905 1040906 Family Trainer 2008-09-26 2.8 2.0 19.0 0.0 2.0 0.0 An outdoor sports themed mini-game collection ...
104103 1104104 Deviant Anomalies 2020-10-31 NaN 0.0 1.0 0.0 1.0 0.0 Deviant Anomalies is an adult visual novel and...
127095 1127096 Ages of Conflict: World War Simulator 2023-02-17 2.6 1.0 18.0 0.0 4.0 1.0 Ages of Conflict is a versatile Map Simulation...
44936 1044937 Xuan-Yuan Sword: The Clouds Faraway 2010-01-12 NaN 0.0 1.0 0.0 1.0 1.0 軒轅劍外傳: 雲之遙 (Chinese) The Clouds Faraway, also...
In [ ]:
games_df.isna().sum()
Out[ ]:
id                  0
name                0
date            34781
rating         116942
reviews             0
plays             694
playing           694
backlogs          694
wishlists         694
description     18924
dtype: int64
In [ ]:
# Drop rows where 'plays', 'playing', 'backlogs', or 'wishlists' are NaN
games_df = games_df.dropna(subset=['plays', 'playing', 'backlogs', 'wishlists'])

# Print the DataFrame to verify the changes
games_df.sample(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
72032 1072033 Defense Zone 3 Ultra HD 2016-12-14 NaN 0.0 5.0 0.0 0.0 0.0 The sequel to the hit strategy game, with new ...
138896 1138897 Get Backers Dakkanoku: Ubawareta Mugen Shiro NaT NaN 0.0 0.0 0.0 2.0 0.0 NaN
118864 1118865 SuperDungeon MegaCorp 2022-05-02 NaN 0.0 0.0 0.0 2.0 0.0 Welcome to SuperDungeon MegaCorp, a puzzle gam...
41982 1041983 Gauntlet 2008-12-31 NaN 2.0 4.0 0.0 4.0 1.0 Gauntlet DS is the cancelled chapter of the po...
41062 1041063 Crazy Mouse 2008-10-15 NaN 1.0 1.0 0.0 0.0 0.0 NaN
62826 1062827 My Bewitching Perfume 2015-06-01 NaN 0.0 1.0 0.0 6.0 4.0 An exciting love simulation game for girls! It...
14574 1014575 FIFA International Soccer 1994-12-31 NaN 0.0 1.0 0.0 2.0 0.0 8-bit port of FIFA International Soccer.
14906 1014907 Seal of the Pharaoh 1994-12-31 NaN 0.0 1.0 0.0 11.0 4.0 Seal of the Pharaoh is a mix of RPG, puzzle-so...
34630 1034631 AFL Premiership 2005 2005-12-31 NaN 1.0 7.0 0.0 0.0 2.0 AFL Premiership 2005 is based off the Australi...
110707 1110708 MontanaBlack Kylo's Rescue 2021-07-02 NaN 0.0 1.0 0.0 0.0 0.0 "You're kidnapping Kylo just because I don't w...
117577 1117578 WWE 2K22 2022-03-11 3.3 84.0 875.0 52.0 65.0 48.0 WWE 2K returns with all the features you can h...
42735 1042736 Burnout Paradise: Cops and Robbers 2009-04-30 3.3 1.0 8.0 0.0 1.0 3.0 The Cops and Robbers Pack is a downloadable pr...
87877 1087878 Fruit Salad 2018-11-18 NaN 0.0 1.0 0.0 2.0 0.0 Complevel 9. Maps 01-06. The palette will chan...
24626 1024627 EZ2Dancer 2000-09-30 NaN 0.0 4.0 0.0 0.0 0.0 EZ2Dancer is a series of dance video games dev...
25171 1025172 Beatmania III: Append Core Remix 2000-12-31 4.3 0.0 4.0 0.0 0.0 0.0 Beatmania III: Append Core Remix is a rhythm g...
In [ ]:
games_df.isna().sum()
Out[ ]:
id                  0
name                0
date            34623
rating         116412
reviews             0
plays               0
playing             0
backlogs            0
wishlists           0
description     18752
dtype: int64
In [ ]:
scores_df.head(15)
Out[ ]:
id score amount
0 1000001 0.5 10
1 1000001 1.0 5
2 1000001 1.5 1
3 1000001 2.0 3
4 1000001 2.5 9
5 1000001 3.0 10
6 1000001 3.5 2
7 1000001 4.0 2
8 1000001 4.5 3
9 1000001 5.0 41
10 1000002 0.5 0
11 1000002 1.0 3
12 1000002 1.5 0
13 1000002 2.0 4
14 1000002 2.5 2

Let us perfom a matematical way of filling in the gaps of the ratings column, with the help of the scores dataset¶

Things that should be considered: While looking into the ratings I found out that a game might have a NaN rating because well - It was not given one, however there is also the possible change that the game could have been given a rating on the site, but just because it is only one "floating" rating, it was not considered and therefore, not put as the average rating (which is understandable, an average of 1 rating is a bit weird to be taken into consideration as a good, all around rating for a game)

Example of what I explained earlier up here ↑

image.png

1. We will look into games_df and see which rows have NaN "rating"
2. Find the corresponding id of the game with the scores it's given in the scores_df
3. Calculate the average rating of the game with a certain id
4. Fill in the NaN value in games_df of with the average rating that was just calculated
In [ ]:
# Identify the rows in games_df where 'rating' is NaN
nan_rating_ids = games_df[games_df['rating'].isna()]['id']

# For each of these rows, find the corresponding rows in scores_df
scores_df_filtered = scores_df[scores_df['id'].isin(nan_rating_ids)]

# Calculate the weighted average rating for each game
scores_df_filtered['total_score'] = scores_df_filtered['score'] * scores_df_filtered['amount']
grouped_scores = scores_df_filtered.groupby('id').agg({'total_score': 'sum', 'amount': 'sum'}).reset_index()
grouped_scores['average_rating'] = (grouped_scores['total_score'] / grouped_scores['amount']).round(1)

# Replace the NaN values in 'rating' in games_df with the calculated average rating
games_df.set_index('id', inplace=True)
grouped_scores.set_index('id', inplace=True)
games_df['rating'].fillna(grouped_scores['average_rating'], inplace=True)
games_df.reset_index(inplace=True)
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2964695771.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  scores_df_filtered['total_score'] = scores_df_filtered['score'] * scores_df_filtered['amount']
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2964695771.py:15: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  games_df['rating'].fillna(grouped_scores['average_rating'], inplace=True)
In [ ]:
print(games_df.isna().sum())
games_df.sample(15)
id                 0
name               0
date           34623
rating         79226
reviews            0
plays              0
playing            0
backlogs           0
wishlists          0
description    18752
dtype: int64
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description
132186 1132712 Yummy Jewels 2023-09-14 NaN 0.0 0.0 0.0 0.0 0.0 Go on a candy puzzle adventure.
64917 1065262 Demented 2015-11-18 NaN 0.0 1.0 0.0 0.0 0.0 Enter the crooked world of Demented, where eve...
10875 1010941 Jennifer Capriati Tennis 1992-09-16 3.5 0.0 1.0 0.0 0.0 0.0 This is a tennis game featuring the game modes...
22233 1022356 Shiritsu Justice Gakuen: Nekketsu Seishun Nikki 2 1999-06-24 4.0 3.0 33.0 0.0 9.0 10.0 This is anPlayStation-exclusive update to the ...
140350 1140898 Sonic Classics: 3-in-1 NaT 2.7 0.0 7.0 1.0 2.0 1.0 A compilation cartridge containing 3 Sonic the...
72328 1072696 Variant: Limits 2017-01-09 NaN 0.0 0.0 0.0 0.0 0.0 Variant: Limits connects mathematics and game ...
168951 1169638 BoBoiBoy: Adudu Attacks! NaT NaN 0.0 0.0 0.0 0.0 0.0 NaN
92181 1092603 O2Jam 2019-06-30 NaN 0.0 1.0 0.0 0.0 0.0 Enjoy the classic rhythm game for everyone! En...
136346 1136883 Terminator: Survivors 2024-10-24 NaN 0.0 0.0 0.0 0.0 2.0 Play as a survivor in an open world set after ...
59861 1060183 LEGO Batman Trilogy 2014-11-11 3.8 5.0 170.0 1.0 35.0 6.0 NaN
86758 1087166 Fly High 2018-10-19 NaN 0.0 0.0 0.0 0.0 0.0 Fly High is action-adventure game with fantasy...
37130 1037336 Cradle of Rome 2007-02-27 3.0 1.0 14.0 0.0 0.0 1.0 Build the heart of the Ancient Roman Empire an...
7197 1007251 Clown-o-Mania 1989-12-31 NaN 0.0 1.0 0.0 1.0 0.0 An obscure arcade-style game for the Amiga and...
79799 1080185 Cleaning house 2017-12-04 2.9 0.0 6.0 1.0 2.0 2.0 a little game about a house, and what is left ...
66310 1066660 The Gate of Firmament 2016-02-25 NaN 0.0 1.0 0.0 3.0 2.0 The “Xuan-Yuan Sword” is an epic oriental RPG ...

Previously we have 116412 missing values in the rating column. Now we have 79226. It is one way or another us going into the right direction, but I am still not sure this is enough. It is possible that this happened due to the fact that they were in fact having not a single score given to them

example: image.png

what is in the scores_df when it comes to this game's ratings:

id,score,amount
1130982,0.5,0
1130982,1.0,0
1130982,1.5,0
1130982,2.0,0
1130982,2.5,0
1130982,3.0,0
1130982,3.5,0
1130982,4.0,0
1130982,4.5,0
1130982,5.0,0
In [ ]:
new_combined_games = games_df.merge(genres_df_encoded, on='id', how='inner')
In [ ]:
new_combined_games.head(15)
Out[ ]:
id name date rating reviews plays playing backlogs wishlists description ... RPG Racing Real Time Strategy Shooter Simulator Sport Strategy Tactical Turn Based Strategy Visual Novel
0 1000001 Cathode Ray Tube Amusement Device 1947-12-31 3.5 65.0 117.0 1.0 28.0 56.0 The cathode ray tube amusement device is the e... ... 0 0 0 0 0 0 0 0 0 0
1 1000002 Bertie the Brain 1950-08-25 2.5 11.0 24.0 0.0 6.0 12.0 Currently considered the first videogame in hi... ... 0 0 0 0 0 0 0 1 0 0
2 1000003 Nim 1951-12-31 1.8 2.0 11.0 0.0 2.0 6.0 The Nimrod was a special purpose computer that... ... 0 0 0 0 0 0 1 0 0 0
3 1000004 Draughts 1952-08-31 2.4 3.0 17.0 0.0 3.0 7.0 A game of draughts (a.k.a. checkers) written f... ... 0 0 0 0 0 0 0 0 0 0
4 1000005 OXO 1952-12-31 3.1 14.0 52.0 1.0 12.0 13.0 OXO was a computer game developed by Alexander... ... 0 0 0 0 0 0 1 0 0 0
5 1000006 Pool 1954-06-26 3.0 5.0 20.0 0.0 2.0 4.0 A game of pool (billiards) developed by Willia... ... 0 0 0 0 0 1 0 0 0 0
6 1000007 Tennis for Two 1958-10-18 3.0 41.0 100.0 0.0 18.0 29.0 Tennis for Two is often credited to be the wor... ... 0 0 0 0 0 1 0 0 0 0
7 1000008 Mouse in the Maze 1959-01-16 2.6 3.0 17.0 0.0 2.0 6.0 A game where players place maze walls, bits of... ... 0 0 0 0 1 0 0 0 0 0
8 1000009 Spacewar! 1962-04-30 3.0 25.0 124.0 0.0 23.0 36.0 Spacewar! is one of the earliest digital compu... ... 0 0 0 1 1 0 0 0 0 0
9 1000010 The Sumerian Game 1964-12-31 2.6 3.0 17.0 0.0 7.0 7.0 The Sumerian Game is a text-based strategy vid... ... 0 0 0 0 0 0 1 0 0 0
10 1000011 Periscope 1965-12-31 2.1 2.0 20.0 0.0 7.0 11.0 The electro-mechanical game was released in th... ... 0 0 0 1 0 0 0 0 0 0
11 1000012 Hamurabi 1968-12-31 2.4 14.0 46.0 0.0 4.0 11.0 Hamurabi is a text-based game of land and reso... ... 0 0 0 0 0 0 1 0 0 0
12 1000013 Civil War 1968-12-31 1.8 3.0 14.0 0.0 2.0 2.0 A turn-based, strategic simulation of fourteen... ... 0 0 0 0 0 0 1 0 1 0
13 1000014 Indy 500 1969-03-31 1.0 0.0 6.0 0.0 5.0 3.0 A first-person arcade racing game released by ... ... 0 1 0 0 0 0 0 0 0 0
14 1000015 Yakyuuken 1969-04-27 NaN 1.0 3.0 0.0 3.0 2.0 One of the very first erotic video games ever ... ... 0 0 0 0 0 0 0 0 0 0

15 rows × 33 columns

In [ ]:
column_names = new_combined_games.columns
column_names
Out[ ]:
Index(['id', 'name', 'date', 'rating', 'reviews', 'plays', 'playing',
       'backlogs', 'wishlists', 'description', 'Adventure', 'Arcade',
       'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA', 'Music',
       'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia',
       'RPG', 'Racing', 'Real Time Strategy', 'Shooter', 'Simulator', 'Sport',
       'Strategy', 'Tactical', 'Turn Based Strategy', 'Visual Novel'],
      dtype='object')
In [ ]:
new_combined_games.isna().sum()
Out[ ]:
id                         0
name                       0
date                   22013
rating                 63288
reviews                    0
plays                      0
playing                    0
backlogs                   0
wishlists                  0
description             7694
Adventure                  0
Arcade                     0
Brawler                    0
Card & Board Game          0
Fighting                   0
Indie                      0
MOBA                       0
Music                      0
Pinball                    0
Platform                   0
Point-and-Click            0
Puzzle                     0
Quiz/Trivia                0
RPG                        0
Racing                     0
Real Time Strategy         0
Shooter                    0
Simulator                  0
Sport                      0
Strategy                   0
Tactical                   0
Turn Based Strategy        0
Visual Novel               0
dtype: int64

It does seem like some of the ratings droped, due to the merging

old results:

id                 0
name               0
date           34623
rating         79226
reviews            0
plays              0
playing            0
backlogs           0
wishlists          0
description    18752

new results:

id                         0
name                       0
date                   22013
rating                 63288
reviews                    0
plays                      0
playing                    0
backlogs                   0
wishlists                  0
description             7694
Adventure                  0
Arcade                     0
Brawler                    0
Card & Board Game          0
Fighting                   0
Indie                      0
MOBA                       0
Music                      0
Pinball                    0
Platform                   0
Point-and-Click            0
Puzzle                     0
Quiz/Trivia                0
RPG                        0
Racing                     0
Real Time Strategy         0
Shooter                    0
Simulator                  0
Sport                      0
Strategy                   0
Tactical                   0
Turn Based Strategy        0
Visual Novel               0
In [ ]:
new_combined_games.shape
Out[ ]:
(146872, 33)

Let's try some modeling¶

For this we are going to be using jaccard similiarity. It pretty much measures in the span between 0 and 1, witing 2 files/sets of data - the closer to 1 they are calculated, the more they are similiar to eachother

https://medium.com/@mayurdhvajsinhjadeja/jaccard-similarity-34e2c15fb524

In [ ]:
# List of genre columns
genre_columns = [
    'Adventure', 'Arcade', 'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA', 
    'Music', 'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia', 'RPG', 
    'Racing', 'Real Time Strategy', 'Shooter', 'Simulator', 'Sport', 'Strategy', 'Tactical', 
    'Turn Based Strategy', 'Visual Novel'
]

# Create a list of columns to keep
columns_to_keep = ['name'] + genre_columns

# Create a copy of the DataFrame with only these columns
jaccard_df = new_combined_games[columns_to_keep].copy()
In [ ]:
jaccard_df.head(15)
Out[ ]:
name Adventure Arcade Brawler Card & Board Game Fighting Indie MOBA Music Pinball ... RPG Racing Real Time Strategy Shooter Simulator Sport Strategy Tactical Turn Based Strategy Visual Novel
0 Cathode Ray Tube Amusement Device 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Bertie the Brain 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 Nim 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
3 Draughts 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 OXO 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
5 Pool 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
6 Tennis for Two 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
7 Mouse in the Maze 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 Spacewar! 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 0 0 0 0 0
9 The Sumerian Game 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
10 Periscope 0 1 0 0 0 0 0 0 1 ... 0 0 0 1 0 0 0 0 0 0
11 Hamurabi 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
12 Civil War 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
13 Indy 500 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
14 Yakyuuken 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

15 rows × 24 columns

In [ ]:
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform
In [ ]:
# Exclude the 'name' column when calculating the Jaccard distance
from sklearnex import patch_sklearn
patch_sklearn()
jaccard_distance = pdist(jaccard_df.drop(columns='name').values, metric='jaccard')
print(jaccard_distance)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[91], line 4
      2 from sklearnex import patch_sklearn
      3 patch_sklearn()
----> 4 jaccard_distance = pdist(jaccard_df.drop(columns='name').values, metric='jaccard')
      5 print(jaccard_distance)

File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\scipy\spatial\distance.py:2232, in pdist(X, metric, out, **kwargs)
   2230 if metric_info is not None:
   2231     pdist_fn = metric_info.pdist_func
-> 2232     return pdist_fn(X, out=out, **kwargs)
   2233 elif mstr.startswith("test_"):
   2234     metric_info = _TEST_METRICS.get(mstr, None)

MemoryError: Unable to allocate 80.4 GiB for an array with shape (10785618756,) and data type float64
In [ ]:
square_jaccard_distances = squareform(jaccard_distance)
In [ ]:
print(square_jaccard_distances)
[[0. 1.]
 [1. 0.]]

Considering the fact that we used pdist it calculates the distance, so it actually shows how different the games are. As we want the opposite, we will reverse it

In [ ]:
jaccard_similiarity_array = 1 - square_jaccard_distances
print(jaccard_similiarity_array)
[[1. 0.]
 [0. 1.]]
In [ ]:
distance_df = pd.DataFrame(jaccard_similiarity_array, index = jaccard_df['name'], columns=jaccard_df['name'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[83], line 1
----> 1 distance_df = pd.DataFrame(jaccard_similiarity_array, index = jaccard_df['name'], columns=jaccard_df['name'])

File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py:827, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    816         mgr = dict_to_mgr(
    817             # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no
    818             # attribute "name"
   (...)
    824             copy=_copy,
    825         )
    826     else:
--> 827         mgr = ndarray_to_mgr(
    828             data,
    829             index,
    830             columns,
    831             dtype=dtype,
    832             copy=copy,
    833             typ=manager,
    834         )
    836 # For data is list-like, or Iterable (will consume into list)
    837 elif is_list_like(data):

File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py:336, in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    331 # _prep_ndarraylike ensures that values.ndim == 2 at this point
    332 index, columns = _get_axes(
    333     values.shape[0], values.shape[1], index=index, columns=columns
    334 )
--> 336 _check_values_indices_shape_match(values, index, columns)
    338 if typ == "array":
    339     if issubclass(values.dtype.type, str):

File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py:420, in _check_values_indices_shape_match(values, index, columns)
    418 passed = values.shape
    419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")

ValueError: Shape of passed values is (2, 2), indices imply (146872, 146872)
In [ ]:
distance_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[92], line 1
----> 1 distance_df.head()

NameError: name 'distance_df' is not defined

The next steps, which are about to be taken into consideration are:

1. Continue working on the model so that it will show similiar games
2. Possibly add the other features 'plays','wishlists' etc.
3. Find a way to work around the occasional memory errors